Goto

Collaborating Authors

 textual cue


Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization

Islam, Md Moinul, Kakouros, Sofoklis, Heikkilä, Janne, Oussalah, Mourad

arXiv.org Artificial Intelligence

The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method, such as the Edmundson method, in both text and video-based evaluation metrics. Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework improves F1-Score by almost 23%. The findings underscore the potential of multimodal integration in producing comprehensive and behaviourally informed video summaries.


Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

Guo, Chenqi, Rong, Mengshuo, Feng, Qianli, Feng, Rongfan, Ma, Yinglong

arXiv.org Artificial Intelligence

Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve knowledge transfer. In supervised image classification, image datasets typically include class labels that represent high-level concepts, suggesting a natural avenue to incorporate textual cues for crossmodal KD. However, these labels rarely capture the deeper semantic structures in real-world visuals and can lead to label leakage if used directly as inputs, ultimately limiting KD performance. To address these issues, we propose a multi-teacher crossmodal KD framework that integrates CLIP image embeddings with learnable WordNet-relaxed text embeddings under a hierarchical loss. By avoiding direct use of exact class names and instead using semantically richer WordNet expansions, we mitigate label leakage and introduce more diverse textual cues. Experiments show that this strategy significantly boosts student performance, whereas noisy or overly precise text embeddings hinder distillation efficiency. Interpretability analyses confirm that WordNet-relaxed prompts encourage heavier reliance on visual features over textual shortcuts, while still effectively incorporating the newly introduced textual cues. Our method achieves state-of-the-art or second-best results on six public datasets, demonstrating its effectiveness in advancing crossmodal KD.


Human-Inspired Long-Term Indoor Localization in Human-Oriented Environment

Zimmerman, Nicky, Sodano, Matteo

arXiv.org Artificial Intelligence

Inspired by how humans navigate, required. In fact, there is a trade-off between accuracy we can exploit insights from human navigation to improve and robustness, and each task requires a different blend of long-term localization, which enables robots to navigate the two. For example, for planning and navigating along the in the same environment over extended periods, spanning path of hundreds of meters, robustness (i.e., avoiding jumps several months or even years. In this work, we summarize in the trajectory) is more important, while high accuracy our past contributions to robust long-term localization and is only required in specific end-points (i.e.


Beyond Trend and Periodicity: Guiding Time Series Forecasting with Textual Cues

Xu, Zhijian, Bian, Yuxuan, Zhong, Jianyuan, Wen, Xiangyu, Xu, Qiang

arXiv.org Artificial Intelligence

This work introduces a novel Text-Guided Time Series Forecasting (TGTSF) task. By integrating textual cues, such as channel descriptions and dynamic news, TGTSF addresses the critical limitations of traditional methods that rely purely on historical data. To support this task, we propose TGForecaster, a robust baseline model that fuses textual cues and time series data using cross-attention mechanisms. We then present four meticulously curated benchmark datasets to validate the proposed framework, ranging from simple periodic data to complex, event-driven fluctuations. Our comprehensive evaluations demonstrate that TGForecaster consistently achieves state-of-the-art performance, highlighting the transformative potential of incorporating textual information into time series forecasting. This work not only pioneers a novel forecasting task but also establishes a new benchmark for future research, driving advancements in multimodal data integration for time series models.


Reliability Analysis of Psychological Concept Extraction and Classification in User-penned Text

Garg, Muskan, Sathvik, MSVPJ, Chadha, Amrit, Raza, Shaina, Sohn, Sunghwan

arXiv.org Artificial Intelligence

The social NLP research community witness a recent surge in the computational advancements of mental health analysis to build responsible AI models for a complex interplay between language use and self-perception. Such responsible AI models aid in quantifying the psychological concepts from user-penned texts on social media. On thinking beyond the low-level (classification) task, we advance the existing binary classification dataset, towards a higher-level task of reliability analysis through the lens of explanations, posing it as one of the safety measures. We annotate the LoST dataset to capture nuanced textual cues that suggest the presence of low self-esteem in the posts of Reddit users. We further state that the NLP models developed for determining the presence of low self-esteem, focus more on three types of textual cues: (i) Trigger: words that triggers mental disturbance, (ii) LoST indicators: text indicators emphasizing low self-esteem, and (iii) Consequences: words describing the consequences of mental disturbance. We implement existing classifiers to examine the attention mechanism in pre-trained language models (PLMs) for a domain-specific psychology-grounded task. Our findings suggest the need of shifting the focus of PLMs from Trigger and Consequences to a more comprehensive explanation, emphasizing LoST indicators while determining low self-esteem in Reddit posts.


InterPrompt: Interpretable Prompting for Interrelated Interpersonal Risk Factors in Reddit Posts

Sathvik, MSVPJ, Sarkar, Surjodeep, Saxena, Chandni, Sohn, Sunghwan, Garg, Muskan

arXiv.org Artificial Intelligence

Mental health professionals and clinicians have observed the upsurge of mental disorders due to Interpersonal Risk Factors (IRFs). To simulate the human-in-the-loop triaging scenario for early detection of mental health disorders, we recognized textual indications to ascertain these IRFs : Thwarted Belongingness (TBe) and Perceived Burdensomeness (PBu) within personal narratives. In light of this, we use N-shot learning with GPT-3 model on the IRF dataset, and underscored the importance of fine-tuning GPT-3 model to incorporate the context-specific sensitivity and the interconnectedness of textual cues that represent both IRFs. In this paper, we introduce an Interpretable Prompting (InterPrompt)} method to boost the attention mechanism by fine-tuning the GPT-3 model. This allows a more sophisticated level of language modification by adjusting the pre-trained weights. Our model learns to detect usual patterns and underlying connections across both the IRFs, which leads to better system-level explainability and trustworthiness. The results of our research demonstrate that all four variants of GPT-3 model, when fine-tuned with InterPrompt, perform considerably better as compared to the baseline methods, both in terms of classification and explanation generation.


Empowering Fake-News Mitigation: Insights from Sharers' Social Media Post-Histories

Schoenmueller, Verena, Blanchard, Simon J., Johar, Gita V.

arXiv.org Artificial Intelligence

Misinformation is a global concern and limiting its spread is critical for protecting democracy, public health, and consumers. We propose that consumers' own social media post-histories are an underutilized data source to study what leads them to share links to fake-news. In Study 1, we explore how textual cues extracted from post-histories distinguish fake-news sharers from random social media users and others in the misinformation ecosystem. Among other results, we find across two datasets that fake-news sharers use more words related to anger, religion and power. In Study 2, we show that adding textual cues from post-histories improves the accuracy of models to predict who is likely to share fake-news. In Study 3, we provide a preliminary test of two mitigation strategies deduced from Study 1 - activating religious values and reducing anger - and find that they reduce fake-news sharing and sharing more generally. In Study 4, we combine survey responses with users' verified Twitter post-histories and show that using empowering language in a fact-checking browser extension ad increases download intentions. Our research encourages marketers, misinformation scholars, and practitioners to use post-histories to develop theories and test interventions to reduce the spread of misinformation.


AidUI: Toward Automated Recognition of Dark Patterns in User Interfaces

Mansur, SM Hasan, Salma, Sabiha, Awofisayo, Damilola, Moran, Kevin

arXiv.org Artificial Intelligence

Past studies have illustrated the prevalence of UI dark patterns, or user interfaces that can lead end-users toward (unknowingly) taking actions that they may not have intended. Such deceptive UI designs can result in adverse effects on end users, such as oversharing personal information or financial loss. While significant research progress has been made toward the development of dark pattern taxonomies, developers and users currently lack guidance to help recognize, avoid, and navigate these often subtle design motifs. However, automated recognition of dark patterns is a challenging task, as the instantiation of a single type of pattern can take many forms, leading to significant variability. In this paper, we take the first step toward understanding the extent to which common UI dark patterns can be automatically recognized in modern software applications. To do this, we introduce AidUI, a novel automated approach that uses computer vision and natural language processing techniques to recognize a set of visual and textual cues in application screenshots that signify the presence of ten unique UI dark patterns, allowing for their detection, classification, and localization. To evaluate our approach, we have constructed ContextDP, the current largest dataset of fully-localized UI dark patterns that spans 175 mobile and 83 web UI screenshots containing 301 dark pattern instances. The results of our evaluation illustrate that \AidUI achieves an overall precision of 0.66, recall of 0.67, F1-score of 0.65 in detecting dark pattern instances, reports few false positives, and is able to localize detected patterns with an IoU score of ~0.84. Furthermore, a significant subset of our studied dark patterns can be detected quite reliably (F1 score of over 0.82), and future research directions may allow for improved detection of additional patterns.


Shirani

AAAI Conferences

Type information plays an important role in the success of information retrieval and recommendation systems in software engineering. Thus, the absence of types in dynamically-typed languages poses a challenge to adapt these systems to support dynamic languages. In this paper, we explore the viability of type inference using textual cues. That is, we formulate the type inference problem as a classification problem which uses the textual features in the source code to predict the type of variables. In this approach, a classifier learns a model to distinguish between types of variables in a program. The model is subsequently used to (approximately) infer the types of other variables. We evaluate the feasibility of this approach on four Java projects wherein type information is already available in the source code and can be used to train and test a classifier. Our experiments show this approach can predict the type of new variables with relatively high accuracy (80% F-measure). These results suggest that textual cues can be complementary tools in inferring types for dynamic languages.


Learning to Reason with Relational Video Representation for Question Answering

Le, Thao Minh, Le, Vuong, Venkatesh, Svetha, Tran, Truyen

arXiv.org Artificial Intelligence

While acquiring visual knowledge of objects and relations from static images has advanced hugely in recent years [7], How does machine learn to reason about the content of a deep video understanding remains elusive. Compared to video in answering a question? A Video QA system must simultaneously static images, video poses new challenges, primarily due understand language, represent visual content to the inherent dynamic nature of visual content over time over space-time, and iteratively transform these representations [6, 34]. At the lowest level, we have correlated motion in response to lingual content in the query, and finally and appearance [6]. At a higher level, we have objects that arriving at a sensible answer. While recent advances in are persistent over time, actions that are local in time, and textual and visual question answering have come up with the relations that can span over an extended length. Thus sophisticated visual representation and neural reasoning searching for an answer from a video facilitates solving mechanisms, major challenges in Video QA remain on dynamic simultaneous sub-tasks in both the visual and lingual spaces, grounding of concepts, relations and actions to support probably in an iterative and compositional fashion.